{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "53cbf8c7-5e33-4a0e-b10e-33da5127c6af",
   "metadata": {
    "ExecutionIndicator": {
     "show": true
    },
    "execution": {
     "iopub.execute_input": "2024-10-01T07:46:03.804799Z",
     "iopub.status.busy": "2024-10-01T07:46:03.803991Z",
     "iopub.status.idle": "2024-10-01T07:46:03.828404Z",
     "shell.execute_reply": "2024-10-01T07:46:03.827755Z",
     "shell.execute_reply.started": "2024-10-01T07:46:03.804776Z"
    },
    "tags": []
   },
   "source": [
    "# Aria Inference Recipes\n",
    "\n",
    "## Section 3: Multi-page PDF Understanding\n",
    "\n",
    "We here show the best recipes to understand a multi-page PDF (e.g. ArXiv papers, financial reports, slides, scanned books) with Aria model. We use the paper of [LongVideoBench](https://arxiv.org/pdf/2407.15754) (Jul 24', ^1) as an example, to show an end-to-end tutorial from a `.pdf` file to various types of responses. \n",
    "\n",
    "By default, we use split-image settings as images in PDFs are information-rich.\n",
    "\n",
    "^1: As per knowledge cutoff of the model, this paper has never been seen during training.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81989981-a740-4597-99e4-a02f1f99b9da",
   "metadata": {
    "tags": []
   },
   "source": [
    "\n",
    "### [General] Load Model and Processor\n",
    "\n",
    "As the input size rapidly increases, we load the model and understand images with two 80GB GPUs (GPU 0, 1). If you find an OOM error, please try to let the model to see more GPUs. The `device_map=\"auto\"` parameter will automatically shard model parameters to all visible GPUs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "c6b3a68b-a029-4df4-86fc-4d74997d5109",
   "metadata": {
    "ExecutionIndicator": {
     "show": false
    },
    "execution": {
     "iopub.execute_input": "2024-10-01T17:26:48.723198Z",
     "iopub.status.busy": "2024-10-01T17:26:48.722842Z",
     "iopub.status.idle": "2024-10-01T17:27:32.664204Z",
     "shell.execute_reply": "2024-10-01T17:27:32.662867Z",
     "shell.execute_reply.started": "2024-10-01T17:26:48.723177Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/root/miniconda3/envs/aria/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n",
      "AriaMoELMForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.\n",
      "  - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes\n",
      "  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).\n",
      "  - If you are not the owner of the model architecture class, please contact the model code owner to update it.\n",
      "Loading checkpoint shards: 100%|██████████| 12/12 [00:36<00:00,  3.03s/it]\n"
     ]
    }
   ],
   "source": [
    "# load Aria model & processor\n",
    "\n",
    "import os\n",
    "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0,1\"\n",
    "\n",
    "import requests\n",
    "import torch\n",
    "from PIL import Image\n",
    "from transformers import AutoModelForCausalLM, AutoProcessor\n",
    "\n",
    "model_id_or_path = \"rhymes-ai/Aria\"\n",
    "\n",
    "model = AutoModelForCausalLM.from_pretrained(model_id_or_path, device_map=\"auto\", torch_dtype=torch.bfloat16, trust_remote_code=True)\n",
    "\n",
    "processor = AutoProcessor.from_pretrained(model_id_or_path, trust_remote_code=True)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4e31f164-08fa-403c-8e7d-4be4e1063f56",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "a90e837d-1f82-419f-b562-8c188150e107",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Installing PyMUPDF & Defining PDF2Image Function\n",
    "\n",
    "To convert a PDF into images, we use the PyMUPDF package. We install it as follows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "74b4ffc7-8ceb-460b-b3cc-206180c3ef7e",
   "metadata": {
    "ExecutionIndicator": {
     "show": true
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%sh\n",
    "pip uninstall PyMUPDF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "553aac05-4a13-4fff-a57c-ef78a32867ca",
   "metadata": {
    "ExecutionIndicator": {
     "show": true
    },
    "execution": {
     "iopub.execute_input": "2024-10-02T03:38:31.207892Z",
     "iopub.status.busy": "2024-10-02T03:38:31.206963Z",
     "iopub.status.idle": "2024-10-02T03:38:31.328064Z",
     "shell.execute_reply": "2024-10-02T03:38:31.327278Z",
     "shell.execute_reply.started": "2024-10-02T03:38:31.207870Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "import fitz  # PyMuPDF\n",
    "from PIL import Image, ImageFile\n",
    "\n",
    "def pdf_to_images(pdf_path):\n",
    "    # Open the PDF file using PyMuPDF\n",
    "    doc = fitz.open(pdf_path)\n",
    "    \n",
    "    # Store each page as a PIL image\n",
    "    images = []\n",
    "    \n",
    "    for page_num in range(doc.page_count):\n",
    "        page = doc.load_page(page_num)\n",
    "        \n",
    "        # Convert page to a pixmap (image representation in PyMuPDF)\n",
    "        pix = page.get_pixmap(matrix=fitz.Matrix(2, 2))\n",
    "        \n",
    "        # Create a PIL image from the pixmap's byte data\n",
    "        img = Image.frombytes(\"RGB\", [pix.width, pix.height], pix.samples)\n",
    "        images.append(img)\n",
    "    \n",
    "    doc.close()\n",
    "    \n",
    "    return images\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "22c536b1-7886-4ea3-a6fd-94277400f1f4",
   "metadata": {
    "ExecutionIndicator": {
     "show": true
    },
    "execution": {
     "iopub.execute_input": "2024-10-02T03:38:34.124560Z",
     "iopub.status.busy": "2024-10-02T03:38:34.123692Z",
     "iopub.status.idle": "2024-10-02T03:38:35.672041Z",
     "shell.execute_reply": "2024-10-02T03:38:35.671342Z",
     "shell.execute_reply.started": "2024-10-02T03:38:34.124537Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "images = pdf_to_images(\"visuals/longvideobench.pdf\")[:9] #limit to 9 pages, removing appendix and references"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "f36df969-66ec-4681-be73-8141c1cde468",
   "metadata": {
    "ExecutionIndicator": {
     "show": true
    },
    "execution": {
     "iopub.execute_input": "2024-10-01T17:27:34.380782Z",
     "iopub.status.busy": "2024-10-01T17:27:34.380622Z",
     "iopub.status.idle": "2024-10-01T17:27:34.386053Z",
     "shell.execute_reply": "2024-10-01T17:27:34.385185Z",
     "shell.execute_reply.started": "2024-10-01T17:27:34.380765Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from typing import List\n",
    "\n",
    "def get_placeholders_for_multiple_pages(images: List):\n",
    "    contents = []\n",
    "    for i, _ in enumerate(images):\n",
    "        contents.extend(\n",
    "            [\n",
    "                {\"text\": f\"Page {i+1}: \", \"type\": \"text\"},\n",
    "                {\"text\": None, \"type\": \"image\"},\n",
    "                {\"text\": \"\\n\", \"type\": \"text\"}\n",
    "            ]\n",
    "        )\n",
    "    return contents\n",
    "\n",
    "contents = get_placeholders_for_multiple_pages(images)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ecbeaf58-cb80-4585-88f5-3e128fd117db",
   "metadata": {
    "tags": []
   },
   "source": [
    "#### Pages Visualization\n",
    "\n",
    "Let's visualize the paper as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "b70c6225-c846-4fa6-a130-004905236367",
   "metadata": {
    "ExecutionIndicator": {
     "show": true
    },
    "execution": {
     "iopub.execute_input": "2024-10-02T03:46:54.794382Z",
     "iopub.status.busy": "2024-10-02T03:46:54.791452Z",
     "iopub.status.idle": "2024-10-02T03:46:54.802169Z",
     "shell.execute_reply": "2024-10-02T03:46:54.801563Z",
     "shell.execute_reply.started": "2024-10-02T03:46:54.794337Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from PIL import Image\n",
    "\n",
    "def create_image_gallery(images, columns=3, spacing=20, bg_color=(200, 200, 200)):\n",
    "    \"\"\"\n",
    "    Combine multiple images into a single larger image in a grid format.\n",
    "    \n",
    "    Parameters:\n",
    "        image_paths (list of str): List of file paths to the images to display.\n",
    "        columns (int): Number of columns in the gallery.\n",
    "        spacing (int): Space (in pixels) between the images in the gallery.\n",
    "        bg_color (tuple): Background color of the gallery (R, G, B).\n",
    "    \n",
    "    Returns:\n",
    "        PIL.Image: A single combined image.\n",
    "    \"\"\"\n",
    "    # Open all images and get their sizes\n",
    "    img_width, img_height = images[0].size  # Assuming all images are of the same size\n",
    "\n",
    "    # Calculate rows needed for the gallery\n",
    "    rows = (len(images) + columns - 1) // columns\n",
    "\n",
    "    # Calculate the size of the final gallery image\n",
    "    gallery_width = columns * img_width + (columns - 1) * spacing\n",
    "    gallery_height = rows * img_height + (rows - 1) * spacing\n",
    "\n",
    "    # Create a new image with the calculated size and background color\n",
    "    gallery_image = Image.new('RGB', (gallery_width, gallery_height), bg_color)\n",
    "\n",
    "    # Paste each image into the gallery\n",
    "    for index, img in enumerate(images):\n",
    "        row = index // columns\n",
    "        col = index % columns\n",
    "\n",
    "        x = col * (img_width + spacing)\n",
    "        y = row * (img_height + spacing)\n",
    "\n",
    "        gallery_image.paste(img, (x, y))\n",
    "\n",
    "    return gallery_image"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "01f48017-0bed-43b9-9baa-ddb02c3783f1",
   "metadata": {
    "ExecutionIndicator": {
     "show": true
    },
    "execution": {
     "iopub.execute_input": "2024-10-02T05:55:43.142210Z",
     "iopub.status.busy": "2024-10-02T05:55:43.141088Z",
     "iopub.status.idle": "2024-10-02T05:55:43.238415Z",
     "shell.execute_reply": "2024-10-02T05:55:43.237739Z",
     "shell.execute_reply.started": "2024-10-02T05:55:43.142155Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "create_image_gallery(images).save(\"longvideobench_gallery.jpg\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3f45e05-3ad9-420a-a459-fd46c35fa004",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Task 1: Find and Narrate Figures in the Paper\n",
    "\n",
    "The first task is to find and provide a description on all the figures in this paper, which is a non-replaceable ability an LMM has (compared with an OCR + LLM pipeline)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "0b9c9faf-5753-49b6-a32a-c32289e60d19",
   "metadata": {
    "ExecutionIndicator": {
     "show": false
    },
    "execution": {
     "iopub.execute_input": "2024-10-01T17:27:34.386859Z",
     "iopub.status.busy": "2024-10-01T17:27:34.386693Z",
     "iopub.status.idle": "2024-10-01T17:28:03.790690Z",
     "shell.execute_reply": "2024-10-01T17:28:03.789848Z",
     "shell.execute_reply.started": "2024-10-01T17:27:34.386842Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_900017/2558006349.py:16: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n",
      "  with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):\n",
      "/root/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:601: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.\n",
      "  warnings.warn(\n",
      "The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.\n",
      "Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "**Figure 1:** This figure illustrates the LONGVIDEOBENCH benchmark, which features referring questions that reference specific video contexts to answer questions about them. The left side (a) shows an example of a referring query where a woman with a red top and black backpack is described, and the reader is asked what changes occur to her backpack. The right side (b) shows a graph comparing the performance of different models (GPT-4o, Gemini-1.5-Pro, GPT-4-Turbo, Gemini-1.5-Flash) on the benchmark as the number of frames increases.\n",
      "\n",
      "**Figure 2:** This figure provides examples of the 17 categories of referring reasoning questions in the LONGVIDEOBENCH. It is divided into two levels: Perception (L1) and Relation (L2). Each category is illustrated with examples, such as identifying objects, events, or attributes in the video context.\n",
      "\n",
      "**Figure 3:** This figure depicts the video and subtitle collection process for LONGVIDEOBENCH. It shows how videos are downloaded, transcribed, and annotated to create high-quality referring questions and answers. The process involves filtering videos, generating subtitles using Whisper-V3-Large, and manual annotation to ensure quality.\n",
      "\n",
      "**Figure 4:** This figure presents the accuracy of proprietary and open-source LMMs with respect to query depth and video duration. It includes heatmaps showing the performance of different models (GPT-4o, Gemini-1.5-Pro, Gemini-1.5-Flash, GPT-4-Turbo) across various video durations and referring query positions within the video.<|im_end|>\n"
     ]
    }
   ],
   "source": [
    "\n",
    "messages = [\n",
    "    {\n",
    "        \"role\": \"user\",\n",
    "        \"content\": [\n",
    "            *contents,\n",
    "            {\"text\": \"Please narrate what each Figure (in total 4 Figures) is about in this paper.\", \"type\": \"text\"},\n",
    "        ],\n",
    "    }\n",
    "]\n",
    "\n",
    "text = processor.apply_chat_template(messages, add_generation_prompt=True)\n",
    "inputs = processor(text=text, images=images, return_tensors=\"pt\", split_image=True)\n",
    "inputs[\"pixel_values\"] = inputs[\"pixel_values\"].to(model.dtype)\n",
    "inputs = {k: v.to(model.device) for k, v in inputs.items()}\n",
    "\n",
    "with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):\n",
    "    output = model.generate(\n",
    "        **inputs,\n",
    "        max_new_tokens=2048,\n",
    "        stop_strings=[\"<|im_end|>\"],\n",
    "        tokenizer=processor.tokenizer,\n",
    "        do_sample=False,\n",
    "    )\n",
    "    output_ids = output[0][inputs[\"input_ids\"].shape[1]:]\n",
    "    result = processor.decode(output_ids, skip_special_tokens=True)\n",
    "\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "85afc553-04bf-4f51-8ee4-677af87a7d58",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Task 2: Summarize the Paper\n",
    "\n",
    "\n",
    "The second task is to summarize this paper. Ideally, we would like this summarization not only from the abstract / introduction / conclusion parts of it, but also includes many important points that are iterated through this paper. \n",
    "\n",
    "And Aria is able to provide a summarization like that. See the results below and try on more papers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "734b9d86-8120-445c-8583-7c0d511a1b94",
   "metadata": {
    "ExecutionIndicator": {
     "show": false
    },
    "execution": {
     "iopub.execute_input": "2024-10-01T17:28:03.792574Z",
     "iopub.status.busy": "2024-10-01T17:28:03.792268Z",
     "iopub.status.idle": "2024-10-01T17:28:51.396020Z",
     "shell.execute_reply": "2024-10-01T17:28:51.395016Z",
     "shell.execute_reply.started": "2024-10-01T17:28:03.792554Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_900017/3856353875.py:16: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n",
      "  with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The paper titled \"LONGVIDEOBENCH: A Benchmark for Long-context Interleaved Video-Language Understanding\" introduces a comprehensive benchmark designed to evaluate the performance of Large Multimodal Models (LMMs) in understanding long-duration videos. The authors, Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li, highlight the limitations of existing benchmarks that primarily focus on short videos and do not adequately test the capabilities of LMMs to handle long-context multimodal inputs.\n",
      "\n",
      "The paper is structured into several sections:\n",
      "\n",
      "1. **Introduction**:\n",
      "   - Discusses the growth in the processing capabilities of foundation models, which now handle longer contexts.\n",
      "   - Emphasizes the need for benchmarks that can evaluate these models on long-duration videos.\n",
      "   - Introduces LONGVIDEOBENCH as a solution to this gap.\n",
      "\n",
      "2. **The Referring Reasoning Task**:\n",
      "   - Identifies the primary challenges in multimodal long-context understanding.\n",
      "   - Defines the referring reasoning task, which involves answering questions that reference specific video contexts.\n",
      "   - Divides the task into two levels: Perception (L1) and Relation (L2), with examples provided for each.\n",
      "\n",
      "3. **Dataset Construction**:\n",
      "   - Describes the process of collecting and creating interleaved video-subtitle data.\n",
      "   - Details the categorization of videos into different duration groups and the collection of high-quality referring questions and answers.\n",
      "\n",
      "4. **Annotating Questions and Answers**:\n",
      "   - Explains the annotation process, including the roles of primary annotators, examiners, and revisers.\n",
      "   - Ensures high-quality annotations by requiring explicit inclusion of referred queries and uniform labeling of frame indices.\n",
      "\n",
      "5. **Evaluation of LONGVIDEOBENCH**:\n",
      "   - Includes a total of 22 LMMs for evaluation, comprising both proprietary and open-source models.\n",
      "   - Provides detailed results on the performance of these models across different duration groups and question categories.\n",
      "   - Highlights the challenges faced by LMMs in understanding long videos and the impact of visual modality.\n",
      "\n",
      "6. **Main Results**:\n",
      "   - Proprietary models like GPT-4o and Gemini-1.5-Pro show significant improvements with longer input lengths.\n",
      "   - Open-source models lag behind and struggle to improve results with more frames.\n",
      "   - Visual modality is crucial, and removing subtitles leads to worse results.\n",
      "\n",
      "7. **Leaderboard**:\n",
      "   - Presents a leaderboard showing the test set results of the evaluated LMMs.\n",
      "   - Discusses the performance variations across different models and duration groups.\n",
      "\n",
      "8. **Conclusion**:\n",
      "   - Summarizes the contributions of LONGVIDEOBENCH in addressing the challenges of evaluating LMMs on long-context multimodal inputs.\n",
      "   - Emphasizes the benchmark's value in providing insights into the deficiencies of existing models and guiding future research directions.\n",
      "\n",
      "The paper concludes by positioning LONGVIDEOBENCH as a valuable tool for advancing the capabilities of LMMs in handling long-duration videos and complex multimodal reasoning tasks.<|im_end|>\n"
     ]
    }
   ],
   "source": [
    "\n",
    "messages = [\n",
    "    {\n",
    "        \"role\": \"user\",\n",
    "        \"content\": [\n",
    "            *contents,\n",
    "            {\"text\": \"Please provide an in-detail summary of the paper.\", \"type\": \"text\"},\n",
    "        ],\n",
    "    }\n",
    "]\n",
    "\n",
    "text = processor.apply_chat_template(messages, add_generation_prompt=True)\n",
    "inputs = processor(text=text, images=images, return_tensors=\"pt\", split_image=True)\n",
    "inputs[\"pixel_values\"] = inputs[\"pixel_values\"].to(model.dtype)\n",
    "inputs = {k: v.to(model.device) for k, v in inputs.items()}\n",
    "\n",
    "with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):\n",
    "    output = model.generate(\n",
    "        **inputs,\n",
    "        max_new_tokens=2048,\n",
    "        stop_strings=[\"<|im_end|>\"],\n",
    "        tokenizer=processor.tokenizer,\n",
    "        do_sample=False,\n",
    "    )\n",
    "    output_ids = output[0][inputs[\"input_ids\"].shape[1]:]\n",
    "    result = processor.decode(output_ids, skip_special_tokens=True)\n",
    "\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e28d721-94c8-40b3-be7e-1542bb07f962",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Task 3: Detailed Question-Answering\n",
    "\n",
    "As the third task, we provide an example for Aria to ask some detail-related question that are in the middle of this paper."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "dcab3d93-73bc-4181-a302-370b741dcfc4",
   "metadata": {
    "ExecutionIndicator": {
     "show": false
    },
    "execution": {
     "iopub.execute_input": "2024-10-01T17:28:51.396925Z",
     "iopub.status.busy": "2024-10-01T17:28:51.396742Z",
     "iopub.status.idle": "2024-10-01T17:29:07.574374Z",
     "shell.execute_reply": "2024-10-01T17:29:07.573648Z",
     "shell.execute_reply.started": "2024-10-01T17:28:51.396907Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_900017/4138294924.py:16: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n",
      "  with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "The two major difficulties in understanding long videos, as outlined in the paper, are:\n",
      "\n",
      "\\begin{enumerate}\n",
      "    \\item \\textbf{Retrieving details from long videos:} Existing Large Multimodal Models (LMMs) often struggle to extract specific details from long sequences. To accurately assess tasks in LONGVIDEOBENCH, a focus on granular details such as objects, events, or attributes is necessary, rather than a summary or topic overview.\n",
      "    \\item \\textbf{Reasoning contextual relations in long videos:} Beyond mere retrieval, it is significantly challenging for LMMs to reason about the relationships among different elements within a long video. Questions in LONGVIDEOBENCH are designed to compel LMMs to interpret the interconnections among diverse context clues spread across the video, necessitating a deep understanding of the temporal and contextual dynamics.\n",
      "\\end{enumerate}<|im_end|>\n"
     ]
    }
   ],
   "source": [
    "\n",
    "messages = [\n",
    "    {\n",
    "        \"role\": \"user\",\n",
    "        \"content\": [\n",
    "            *contents,\n",
    "            {\"text\": \"According to the paper, what are the two major difficulties in understanding long videos? Reply me in Latex format.\", \"type\": \"text\"},\n",
    "        ],\n",
    "    }\n",
    "]\n",
    "\n",
    "text = processor.apply_chat_template(messages, add_generation_prompt=True)\n",
    "inputs = processor(text=text, images=images, return_tensors=\"pt\", split_image=True)\n",
    "inputs[\"pixel_values\"] = inputs[\"pixel_values\"].to(model.dtype)\n",
    "inputs = {k: v.to(model.device) for k, v in inputs.items()}\n",
    "\n",
    "with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):\n",
    "    output = model.generate(\n",
    "        **inputs,\n",
    "        max_new_tokens=512,\n",
    "        stop_strings=[\"<|im_end|>\"],\n",
    "        tokenizer=processor.tokenizer,\n",
    "        do_sample=False,\n",
    "    )\n",
    "    output_ids = output[0][inputs[\"input_ids\"].shape[1]:]\n",
    "    result = processor.decode(output_ids, skip_special_tokens=True)\n",
    "    \n",
    "print(result)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "aria",
   "language": "python",
   "name": "aria"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
